Release 10.1A: OpenEdge Development:
Internationalizing Applications


Techniques for working with multi-byte characters

The following techniques might save you time and trouble:

Choosing the appropriate unit of measure

Several Progress 4GL elements, including the LENGTH function, OVERLAY statement, SUBSTRING function, and SUBSTRING statement, let you specify the unit of measure as the character, the byte, or the column. If you choose the wrong unit of measure, you might split or overlay a multi-byte character. Consider the following example:

DEFINE VARIABLE char-over AS CHARACTER FORMAT “X(8)”. 
char-over = “abcefg”. 
OVERLAY(char-over),1,4,”RAW”) = “wxyz”. /* RAW is wrong */ 
DISPLAY char-over WITH 1 COLUMN. 

The example defines a character variable and sets it to a string of seven characters, the fourth of which is double byte. The example then overlays a string of four, single-byte characters on the original string, starting at position one and continuing for four positions. Unfortunately, the unit of measure is the byte (specified by RAW), so the fourth byte of the second string, which is the character z, overlays the fourth byte of the original string, which is the lead byte of the double-byte character.

Figure 8–5 shows how the z in the second string overlays the lead byte of the double-byte character in the original string.

Figure 8–5: A single-byte character overlaying a lead byte

All that remains of the multi-byte character is the trail-byte, as shown in Figure 8–6.

Figure 8–6: Result of a single-byte character overlaying a lead byte

To fix this error, change the unit of measure to CHARACTER:

DEFINE VARIABLE char-over AS CHARACTER FORMAT “X(8)”. 
char-over = “abcefg”. 
OVERLAY(char-over),1,4,”CHARACTER”) = “wxyz”. /* CHARACTER is correct */ 
DISPLAY char-over WITH 1 COLUMN. 

The corrected program produces the string shown in Figure 8–7.

Figure 8–7: String produced by an OVERLAY statement whose unit of measure is the character

Testing character strings for multi-byte characters

To determine whether a character string contains multi-byte characters, use the LENGTH function, which returns the number of characters, bytes, or columns in a string. The syntax is:

Syntax
LENGTH ( { string [ type ] | raw-expression } ) 

string

A character expression. The specified string can contain double-byte characters.

type

A character expression that indicates whether you want the length of a string in character units, bytes, or columns. A double-byte character registers as one character unit. The default unit of measurement is character units.

There are three valid types: CHARACTER, RAW, and COLUMN. The expression "CHARACTER" indicates that the length is measured in characters, including double-byte characters. The expression "RAW" indicates that the length is measured in bytes. The expression "COLUMN" indicates that the length is measured in columns. If you specify the type as a constant expression, OpenEdge validates the type specification at compile time. If you specify the type as a variable expression, OpenEdge validates the type specification at run time.

raw-expression

A function or variable name that returns a raw value.

To use the technique, call LENGTH twice: once with the CHARACTER option, which returns the length in characters, and once with the RAW option, which returns the length in bytes. Then, compare the two lengths. If they are equal, the string contains only single-byte characters; otherwise, the string contains at least one multi-byte character.

The following examples illustrate the technique The first example tests a character string consisting of one double-byte character. Since the length of the string in characters (1) does not match the length in bytes (2), the example displays Multi-byte characters in the string:

DEFINE VARIABLE mychar AS CHARACTER INITIAL "". 
IF LENGTH(mychar,"CHARACTER") = LENGTH(mychar,"RAW") 
THEN DISPLAY "No multi-byte characters in the string". 
ELSE DISPLAY "Multi-byte characters in the string". 

The second example tests a character string consisting of three single-byte characters. Since the length of the string in characters (3) matches the length in bytes (3), this example displays No multi-byte characters in the string:

DEFINE VARIABLE mychar AS CHARACTER INITIAL "123". 
IF LENGTH(mychar,"CHARACTER") = LENGTH(mychar,"RAW") 
THEN DISPLAY "No multi-byte characters in the string". 
ELSE DISPLAY "Multi-byte characters in the string". 

Testing for a lead-byte value

The next technique involves testing a byte for a lead-byte value. Lead bytes (and trail bytes) often have special values to distinguish them. Table 8–5 lists the lead-byte and trail-byte values for the multi-byte code pages OpenEdge supports.

Table 8–5: Lead byte and trail byte values 
Code page
Language or standard
Lead-byte values
Trail-byte values
BIG-5
Traditional Chinese
161 through 254
64 through 126
161 through 254
CP949
Korean
129 through 254
65 through 90
97 through 122
129 through 254
CP950
Traditional Chinese
129 through 254
64 through 126
128 through 254
CP1361
Korean
132 through 211
216 through 222
224 through 249
65 through 127
129 through 254
EUCJIS
Japanese
142
164 through 254
161 through 254
GB2312
Simplified Chinese
161 through 254
161 through 254
GB18030 1
Extended Chinese
KSC5601
Korean
161 through 254
161 through 254
SHIFT-JIS
Japanese
129 through 159
224 through 252
64 through 126
128 through 252
UTF-8
Unicode
193 through 239
128 through 191
  1. The GB18030 code page is a multi-byte code page, consisting of one-, two-, and four-byte characters, that extends the GB2312 code page and includes all characters defined in Unicode. Unlike most multi-byte code pages that OpenEdge supports, you cannot use the lead byte of multi-byte characters in the GB18030 code page to determine the character's length. Progress uses the International Components for Unicode (ICU) library to convert characters between the GB18030 code page and Unicode within the OpenEdge GUI client.

You cannot always assume a byte with a lead-byte value is a lead byte, or a byte with a trail-byte value is a trail byte. This is because the possible values for trail bytes overlap those of lead bytes and single bytes. For example, the value 164 can correspond to a lead byte or a trail byte. To determine which it is, you must inspect the string.

To determine if a byte has a lead-byte value, use the IS-LEAD-BYTE function, which evaluates a character expression and returns YES if the first byte of the first character of the character string has a value within the range permitted for lead bytes. Otherwise, IS-LEAD-BYTE returns NO. IS-LEAD-BYTE has the following syntax:

Syntax
IS-LEAD-BYTE ( string ) 

string

A character expression (a constant, field name, variable name, or any combination of these) whose value is a character.

In the following example, IS-LEAD-BYTE examines a string whose first character is single byte. Since the first byte of the first character of the string is not a lead byte, its value is not within the range permitted for lead bytes, IS-LEAD-BYTE returns NO, and the example displays Lead: no:

DEFINE VARIABLE lead AS LOGICAL. 
lead = IS-LEAD-BYTE("xy"). 
DISPLAY lead WITH 1 COLUMN. 

The following example is identical to the preceding example except that the first character of the string is double byte. Since the first byte of the first character of the string is a lead byte, its value falls within the range permitted for lead bytes, IS-LEAD-BYTE returns YES, and the example displays Lead: yes:

DEFINE VARIABLE lead AS LOGICAL. 
lead = IS-LEAD-BYTE("xy"). 
DISPLAY lead WITH 1 COLUMN. 


Copyright © 2005 Progress Software Corporation
www.progress.com
Voice: (781) 280-4000
Fax: (781) 280-4095